Customer Churn Prediction in the Telecom Industry: A Machine Learning Approach ๐ยถ
Customer churnโthe discontinuation of a companyโs servicesโposes a major challenge in the telecom industry. With annual churn rates between 15-25%, reducing customer attrition is a strategic priority, as retaining existing customers is far more cost-effective than acquiring new ones.
Objectives ๐ฏยถ
This analysis aims to:
- ๐ Explore churn patterns across customer demographics, service types, and usage behavior.
- ๐ Identify key factors contributing to churn by analyzing correlations and feature importance.
- ๐ค Build and evaluate predictive models, including Logistic Regression, Decision Trees, K-Nearest Neighbors (KNN), and Ensemble Methods.
- ๐ Compare model performance using evaluation metrics to determine the most effective approach for churn prediction.
By leveraging machine learning, this study provides insights into customer churn trends, helping telecom companies identify high-risk customers and improve retention efforts.
๐ Dataset: customer_churn_dataset.csv
๐ Evaluation Metrics: Precision, Recall, F1-score, and ROC-AUC
## REQUIRED LIBRARIES
# For data wrangling
import pandas as pd
import numpy as np
# For visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.offline import init_notebook_mode
# For preprocessing and modeling
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
#Model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
init_notebook_mode(connected=True)
๐ Exploratory Data Analysis (EDA)ยถ
- Understanding the dataset distribution
- Checking for missing values and outliers
- Identifying feature correlations
- Finding key patterns in churn behavior
df = pd.read_csv('customer_churn_dataset.csv')
df.head(2)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cust_1 | Male | 0.0 | Yes | No | 2.0 | Yes | No | NaN | No | No internet service | 1 |
| 1 | Cust_2 | Female | 1.0 | No | No | NaN | Yes | No | Fiber optic | Yes | Yes | 0 |
df.shape
(10000, 12)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 10000 non-null object 1 gender 9000 non-null object 2 SeniorCitizen 9000 non-null float64 3 Partner 9000 non-null object 4 Dependents 9000 non-null object 5 tenure 9000 non-null float64 6 PhoneService 9000 non-null object 7 MultipleLines 9000 non-null object 8 InternetService 9000 non-null object 9 OnlineSecurity 9000 non-null object 10 OnlineBackup 9000 non-null object 11 Churn 10000 non-null int64 dtypes: float64(2), int64(1), object(9) memory usage: 937.6+ KB
df.isnull().sum()
customerID 0 gender 1000 SeniorCitizen 1000 Partner 1000 Dependents 1000 tenure 1000 PhoneService 1000 MultipleLines 1000 InternetService 1000 OnlineSecurity 1000 OnlineBackup 1000 Churn 0 dtype: int64
churned_out_color = '#B71C1C'
active_customers_color = '#00BFA5'
# Data Visualization and Exploration
# Prepare the data
labels = ['Churned Out', 'Active Customers']
sizes = [df.Churn[df['Churn'] == 1].count(), df.Churn[df['Churn'] == 0].count()]
print(sizes)
# Create the pie chart
fig = px.pie(
names=labels,
values=sizes,
title="Proportion of Customers Churned out and Active Customers",
hole=0.0, # For a standard pie chart; set hole=0.5 for a donut chart
)
# Optional: Tuning visual appearance
fig.update_traces(
pull=[0, 0.05], # Pulls the 'Retained' slice out slightly, similar to "explode"
textinfo='percent+label', # Show percentage and label together
hoverinfo='label+percent+value', # Hover information
marker=dict(line=dict(color='black', width=0.5),colors=[churned_out_color, active_customers_color]), # Customize marker line
)
# Adjust the layout to set the width and height
fig.update_layout(
width=800, # Set desired width (e.g., 600 pixels)
height=500 # Set desired height (e.g., 400 pixels)
)
# Show the chart
fig.show()
[5020, 4980]
# Prepare data for analysis and exploration
# - Create a copy of the original DataFrame for exploratory data analysis (EDA)
# - Remove the 'customerID' column as it is irrelevant for modeling
# - Map categorical values in 'Churn' and 'SeniorCitizen' columns to more meaningful labels
# for better readability and interpretation
df_copy = df.copy()
# Drop the customerID column
if 'customerID' in df.columns:
df = df.drop(columns=['customerID'])
# Drop the customerID column
if 'customerID' in df_copy.columns:
df_copy = df_copy.drop(columns=['customerID'])
# Map the Churn column to the desired labels in the copy
df_copy['Churn'] = df_copy['Churn'].map({0: 'Active Customers', 1: 'Churned Out'})
df_copy['SeniorCitizen'] = df_copy['SeniorCitizen'].map({0: 'Non-Senior Citizen', 1: 'Senior Citizen'})
#Gender
fig = px.histogram(df_copy,
x='gender',
color='Churn',
title='Churn Rate by Gender',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#SeniorCitizen
fig = px.histogram(df_copy,
x='SeniorCitizen',
color='Churn',
title='Churn Rate by Senior Citizen',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#Partner
fig = px.histogram(df_copy,
x='Partner',
color='Churn',
title='Churn Rate by Partner',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#Dependents
fig = px.histogram(df_copy,
x='Dependents',
color='Churn',
title='Churn Rate by Dependents',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#PhoneService
fig = px.histogram(df_copy,
x='PhoneService',
color='Churn',
title='Churn Rate by PhoneService',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#MultipleLines
fig = px.histogram(df_copy,
x='MultipleLines',
color='Churn',
title='Churn Rate by MultipleLines',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#InternetService
fig = px.histogram(df_copy,
x='InternetService',
color='Churn',
title='Churn Rate by InternetService',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#OnlineSecurity
fig = px.histogram(df_copy,
x='OnlineSecurity',
color='Churn',
title='Churn Rate by OnlineSecurity',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#OnlineBackup
fig = px.histogram(df_copy,
x='OnlineBackup',
color='Churn',
title='Churn Rate by OnlineBackup',
barmode='group',
color_discrete_sequence=[churned_out_color,active_customers_color],)
fig.update_layout(xaxis_title='Active Customers vs Churned out', yaxis_title='Count', width=800, height=400)
fig.show()
#tenure
# Group and aggregate data
grouped_data = df_copy.groupby(['tenure', 'Churn']).size().reset_index(name='Customer Count')
# Create the line chart
fig = px.line(
grouped_data,
x='tenure',
y='Customer Count',
color='Churn',
title='Churn Rate by Tenure',
color_discrete_sequence=[active_customers_color,churned_out_color]
)
# Update layout for better labels
fig.update_layout(
xaxis_title='Tenure',
yaxis_title='Customer Count',
legend_title='Churn Status',
)
# Show the figure
fig.show()
๐ Key Observations from Customer Churn Analysisยถ
We note the following insights from the visualizations:
๐ Churn Rate is Nearly 50%
- The dataset contains 5,020 churned customers and 4,980 non-churned customers, making churn prediction an important task.
๐ Most Features Show Similar Distributions
- Gender, Partner, Dependents, PhoneService, MultipleLines, OnlineSecurity, and OnlineBackup all have nearly equal proportions between churned and non-churned customers.
- This suggests that these individual features alone are not strong predictors of churn.
๐ Tenure Shows a Clear Pattern
- Customers with shorter tenure (0-20 months) exhibit higher churn rates, indicating that early-stage customers are more likely to leave.
- Churn fluctuates but stabilizes beyond 30 months, though there are intermittent spikes.
- Understanding contract renewals, pricing changes, or service issues at these peaks can provide deeper insights.
- Retention strategies should focus on early-tenure customers, potentially through personalized offers or improved onboarding.
# Drop rows with missing values
df_copy = df_copy.dropna()
# Encode categorical variables
label_encoders = {}
for column in df_copy.select_dtypes(include=['object']).columns:
le = LabelEncoder()
df_copy[column] = le.fit_transform(df_copy[column])
label_encoders[column] = le
# Compute the correlation matrix
correlation_matrix = df_copy.corr()
# Plot the heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
๐ Understanding the Correlation Matrixยถ
The correlation matrix shows how features relate to each other and to churn. Key takeaways:
๐ No Strong Correlation with Churn
- All features have low correlation values with churn, meaning no single feature alone is a strong predictor.
- Tenure shows a slight negative correlation, indicating that customers with longer tenure are less likely to churn.
๐ Minimal Multicollinearity
- No two features are highly correlated, meaning redundant features are unlikely.
- This suggests feature interactions might be more important than individual features.
๐ Why Analyze Feature Importance?ยถ
Since correlation alone doesn't tell us how much each feature contributes to churn, we need to evaluate feature importance:
โ
Identify which features have the most impact on predictions.
โ
Go beyond simple correlations by capturing non-linear relationships.
โ
Prioritize key factors to improve churn modeling and business strategies.
To achieve this, we use RandomForestClassifier, which ranks features based on their contribution to decision-making. This helps confirm whether features like tenure and contract type are indeed the strongest predictors.
# Preprocess the data
df_copy = df_copy.dropna() # Drop rows with missing values
label_encoders = {}
for column in df_copy.select_dtypes(include=['object']).columns:
le = LabelEncoder()
df_copy[column] = le.fit_transform(df_copy[column])
label_encoders[column] = le
# Split the data into features and target
X = df_copy.drop('Churn', axis=1)
y = df_copy['Churn']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a random forest classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
# Get feature importance
feature_importance = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False)
# Print feature importance
print(feature_importance)
tenure 0.429750 OnlineSecurity 0.092899 OnlineBackup 0.086982 MultipleLines 0.082934 InternetService 0.081157 Partner 0.052440 gender 0.049084 PhoneService 0.045417 SeniorCitizen 0.044344 Dependents 0.034993 dtype: float64
โก Why Create Baseline Models?ยถ
Before building a complex model, it's essential to establish baseline performance using simpler models. This helps in:
โ
Setting a reference point โ Helps measure improvement when testing more advanced models.
โ
Identifying initial patterns โ Even simple models can highlight key predictive features.
โ
Balancing interpretability and performance โ Decision Trees and Logistic Regression provide insight into feature importance and separability.
๐ Baseline Models: Decision Tree & Logistic Regressionยถ
To create a solid starting point, we train two different models:
1๏ธโฃ Decision Tree Classifier
- Captures non-linear relationships and feature interactions.
- Helps identify key decision-making splits for churn prediction.
2๏ธโฃ Logistic Regression
- A simple, interpretable model that provides probabilities of churn.
- Acts as a benchmark to compare against more complex models.
๐ Key Metrics Evaluatedยถ
We evaluate both models using:
- Accuracy โ Overall correctness.
- Precision โ How many predicted churns were correct.
- Recall โ How many actual churn cases were detected.
- F1 Score โ A balance of precision and recall.
These baselines allow us to compare future models and ensure that advanced techniques actually provide real improvements over simpler methods. ๐
#Creating baseline models
# Preprocess the data (assuming df_copy is already preprocessed and ready)
# Split the data into features and target
x = df_copy.drop('Churn', axis=1)
y = df_copy['Churn']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Train a Decision Tree classifier
dt_clf = DecisionTreeClassifier(random_state=42, criterion='entropy', max_depth=5, )
dt_clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = dt_clf.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1_score_baseline_dt = f1_score(y_test, y_pred)
print(f'Accuracy of the DecisionTreeClassifier model: {accuracy:.3f}')
print(f'Precision of the DecisionTreeClassifier model: {precision:.3f}')
print(f'Recall of the DecisionTreeClassifier model: {recall:.3f}')
print(f'F1 Score of the DecisionTreeClassifier model: {f1_score_baseline_dt:.3f}')
Accuracy of the DecisionTreeClassifier model: 0.514 Precision of the DecisionTreeClassifier model: 0.488 Recall of the DecisionTreeClassifier model: 0.413 F1 Score of the DecisionTreeClassifier model: 0.447
#Creating baseline models
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Preprocess the data (assuming df_copy is already preprocessed and ready)
# Split the data into features and target
x = df_copy.drop('Churn', axis=1)
y = df_copy['Churn']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Train a Logistic Regression classifier
lr_clf = LogisticRegression(random_state=42, max_iter=500)
lr_clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = lr_clf.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1_score_baseline_lr = f1_score(y_test, y_pred)
print(f'Accuracy of the Logistic Regression model: {accuracy:.3f}')
print(f'Precision of the Logistic Regression model: {precision:.3f}')
print(f'Recall of the Logistic Regression model: {recall:.3f}')
print(f'F1 Score of the Logistic Regression model: {f1_score_baseline_lr:.3f}')
Accuracy of the Logistic Regression model: 0.559 Precision of the Logistic Regression model: 0.543 Recall of the Logistic Regression model: 0.461 F1 Score of the Logistic Regression model: 0.498
๐ ๏ธ Data Cleaning & Preprocessingยถ
- Handling missing values
- Encoding categorical variables
- Feature selection and scaling
# Filling up the missing values
#Gender
missing_gender_percent = df['gender'].isnull().sum() / len(df) * 100
print(f"Missing Gender Values: {missing_gender_percent:.2f}%")
df.loc[df['gender'].isnull(), 'gender'] = "Unknown"
#Senior Citizen
missing_senior_citizen_percent = df['SeniorCitizen'].isnull().sum() / len(df) * 100
print(f"Missing SeniorCitizen Values: {missing_senior_citizen_percent:.2f}%")
senior_dist = df['SeniorCitizen'].value_counts(normalize=True)
df.loc[df['SeniorCitizen'].isnull(), 'SeniorCitizen'] = np.random.choice([0.0, 1.0], p=senior_dist.values)
#Partner
missing_partner = df['Partner'].isnull().sum() / len(df) * 100
print(f"Missing Partner Values: {missing_partner:.2f}%")
partner_dist = df['Partner'].value_counts(normalize=True)
df.loc[df['Partner'].isnull(), 'Partner'] = np.random.choice(['Yes', 'No'], p=partner_dist.values)
#Dependents
missing_dependents = df['Dependents'].isnull().sum() / len(df) * 100
print(f"Missing Dependents Values: {missing_dependents:.2f}%")
dependent_dist = df['Dependents'].value_counts(normalize=True)
df.loc[df['Dependents'].isnull(), 'Dependents'] = np.random.choice(['Yes', 'No'], p=dependent_dist.values)
#Tenure
missing_tenure = df['tenure'].isnull().sum() / len(df) * 100
print(f"Missing Tenure Values: {missing_tenure:.2f}%")
df.loc[df['tenure'].isnull(), 'tenure'] = df['tenure'].median()
#Phone Service
missing_phone_service = df['PhoneService'].isnull().sum() / len(df) * 100
print(f"Missing PhoneService Values: {missing_phone_service:.2f}%")
phone_service_dist = df['PhoneService'].value_counts(normalize=True)
df.loc[df['PhoneService'].isnull(), 'PhoneService'] = np.random.choice(['Yes', 'No'], p=phone_service_dist.values)
#Multiple Lines
missing_multiple_lines = df['MultipleLines'].isnull().sum() / len(df) * 100
print(f"Missing MultipleLines Values: {missing_multiple_lines:.2f}%")
multiple_lines_dist = df['MultipleLines'].value_counts(normalize=True)
df.loc[df['MultipleLines'].isnull(), 'MultipleLines'] = np.random.choice(multiple_lines_dist.index, p=multiple_lines_dist.values)
#Internet Service
missing_internet_service = df['InternetService'].isnull().sum() / len(df) * 100
print(f"Missing InternetService Values: {missing_internet_service:.2f}%")
internet_service_dist = df['InternetService'].value_counts(normalize=True)
df.loc[df['InternetService'].isnull(), 'InternetService'] = np.random.choice(internet_service_dist.index, p=internet_service_dist.values)
#Online Security
missing_online_security = df['OnlineSecurity'].isnull().sum() / len(df) * 100
print(f"Missing OnlineSecurity Values: {missing_online_security:.2f}%")
online_security_dist = df['OnlineSecurity'].value_counts(normalize=True)
df.loc[df['OnlineSecurity'].isnull(), 'OnlineSecurity'] = np.random.choice(online_security_dist.index, p=online_security_dist.values)
#Online Backup
missing_online_backup = df['OnlineBackup'].isnull().sum() / len(df) * 100
print(f"Missing OnlineBackup Values: {missing_online_backup:.2f}%")
online_backup_dist = df['OnlineBackup'].value_counts(normalize=True)
df.loc[df['OnlineBackup'].isnull(), 'OnlineBackup'] = np.random.choice(online_backup_dist.index, p=online_backup_dist.values)
Missing Gender Values: 10.00% Missing SeniorCitizen Values: 10.00% Missing Partner Values: 10.00% Missing Dependents Values: 10.00% Missing Tenure Values: 10.00% Missing PhoneService Values: 10.00% Missing MultipleLines Values: 10.00% Missing InternetService Values: 10.00% Missing OnlineSecurity Values: 10.00% Missing OnlineBackup Values: 10.00%
๐ Handling Missing Valuesยถ
๐งโ๐คโ๐ง Genderยถ
- Missing values replaced with
"Unknown"instead of imputing a category. - โ Why? Since gender is categorical and missing values are not predictable, it's better to keep them explicit rather than introducing bias.
๐ด Senior Citizenยถ
- Filled probabilistically based on the distribution of existing values.
- โ Why? Maintains the real-world proportion instead of defaulting to a specific class.
๐ Partner & ๐ผ Dependentsยถ
- Filled probabilistically based on the existing ratio of "Yes"/"No".
- โ Why? Prevents over-representing either category and ensures realistic data patterns.
๐ Tenureยถ
- Filled with the median instead of the mean.
- โ Why? The median is less sensitive to outliers, ensuring a more balanced distribution.
๐ Phone Service & ๐ถ Multiple Linesยถ
- Filled probabilistically using the distribution of available values.
- โ Why? Helps maintain the service adoption rate in the dataset.
๐ Internet Serviceยถ
- Filled probabilistically using the existing category proportions.
- โ Why? Ensures that the distribution of different service types remains realistic.
๐ Online Security & ๐ Online Backupยถ
- Filled probabilistically based on category frequencies.
- โ Why? Retains natural variations rather than over-sampling any single category.
๐น Why is probabilistic filling better?ยถ
- Prevents bias โ avoids over-representing any one category.
- Mimics real-world patterns โ missing data is distributed naturally.
- More accurate predictions โ models learn from a dataset that reflects actual trends.
๐ Now, our dataset is clean, consistent, and ready for analysis!
df.isnull().sum()
gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 Churn 0 dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 10000 non-null object 1 SeniorCitizen 10000 non-null float64 2 Partner 10000 non-null object 3 Dependents 10000 non-null object 4 tenure 10000 non-null float64 5 PhoneService 10000 non-null object 6 MultipleLines 10000 non-null object 7 InternetService 10000 non-null object 8 OnlineSecurity 10000 non-null object 9 OnlineBackup 10000 non-null object 10 Churn 10000 non-null int64 dtypes: float64(2), int64(1), object(8) memory usage: 859.5+ KB
##Encoding the data
# Create a LabelEncoder object for binary features
df.head()
# List of binary columns (for Label Encoding)
binary_cols = ['SeniorCitizen', 'Partner', 'Dependents', 'PhoneService']
# Apply Label Encoding to binary features
le = LabelEncoder()
for col in binary_cols:
df[col] = le.fit_transform(df[col])
# List of categorical columns (for One-Hot Encoding)
categorical_cols = ['gender', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup']
# Apply One-Hot Encoding
df_preprocessed = pd.get_dummies(df, columns=categorical_cols, drop_first=False, dtype='int')
# Initialize MinMaxScaler
scaler = MinMaxScaler()
# Apply MinMaxScaler to the 'tenure' field and create a new column 'scaled_tenure'
df_preprocessed['scaled_tenure'] = scaler.fit_transform(df[['tenure']])
๐ Data Preprocessing: Encoding & Scalingยถ
To prepare our dataset for machine learning, we need to convert categorical features into numerical form and scale numerical features for better model performance.
๐ Encoding Categorical Dataยถ
We apply different encoding techniques based on the feature type:
โ Label Encoding (For Binary Features)
- Applied to:
SeniorCitizen,Partner,Dependents,PhoneService - Why? These features have only two categories (
Yes/Noor0/1), making label encoding the most efficient approach.
โ One-Hot Encoding (For Multi-Category Features)
- Applied to:
gender,MultipleLines,InternetService,OnlineSecurity,OnlineBackup - Why? One-hot encoding creates separate columns for each category, ensuring models correctly interpret non-ordinal data.
๐ Scaling Numerical Featuresยถ
โ MinMax Scaling (For tenure)
- Why? Normalizes values between 0 and 1, preventing
tenurefrom dominating other features due to its larger range.
โ Final Step: Our dataset is now fully encoded, normalized, and ready for model training! ๐
# Print confirmation
print("DataFrame `df_preprocessed` is ready for model training!")
df_preprocessed.head()
DataFrame `df_preprocessed` is ready for model training!
| SeniorCitizen | Partner | Dependents | tenure | PhoneService | Churn | gender_Female | gender_Male | gender_Unknown | MultipleLines_No | ... | InternetService_DSL | InternetService_Fiber optic | InternetService_No | OnlineSecurity_No | OnlineSecurity_No internet service | OnlineSecurity_Yes | OnlineBackup_No | OnlineBackup_No internet service | OnlineBackup_Yes | scaled_tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 2.0 | 1 | 1 | 0 | 1 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0.014085 |
| 1 | 1 | 0 | 0 | 37.0 | 1 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0.507042 |
| 2 | 0 | 0 | 1 | 37.0 | 1 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0.507042 |
| 3 | 1 | 0 | 0 | 13.0 | 1 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0.169014 |
| 4 | 1 | 1 | 1 | 55.0 | 0 | 1 | 0 | 0 | 1 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0.760563 |
5 rows ร 22 columns
df_preprocessed.describe()
| SeniorCitizen | Partner | Dependents | tenure | PhoneService | Churn | gender_Female | gender_Male | gender_Unknown | MultipleLines_No | ... | InternetService_DSL | InternetService_Fiber optic | InternetService_No | OnlineSecurity_No | OnlineSecurity_No internet service | OnlineSecurity_Yes | OnlineBackup_No | OnlineBackup_No internet service | OnlineBackup_Yes | scaled_tenure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | ... | 10000.000000 | 10000.0000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 0.550100 | 0.447700 | 0.548800 | 36.513900 | 0.541000 | 0.502000 | 0.449300 | 0.450700 | 0.100000 | 0.301900 | ... | 0.302000 | 0.4044 | 0.293600 | 0.300500 | 0.296600 | 0.402900 | 0.30380 | 0.398600 | 0.297600 | 0.500196 |
| std | 0.497509 | 0.497282 | 0.497638 | 19.630256 | 0.498341 | 0.500021 | 0.497448 | 0.497588 | 0.300015 | 0.459105 | ... | 0.459148 | 0.4908 | 0.455434 | 0.458498 | 0.456781 | 0.490506 | 0.45992 | 0.489635 | 0.457225 | 0.276482 |
| min | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.281690 |
| 50% | 1.000000 | 0.000000 | 1.000000 | 37.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.507042 |
| 75% | 1.000000 | 1.000000 | 1.000000 | 52.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | ... | 1.000000 | 1.0000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 0.718310 |
| max | 1.000000 | 1.000000 | 1.000000 | 72.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.0000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 |
8 rows ร 22 columns
import plotly.express as px
# Drop the 'tenure' column
filtered_df = df_preprocessed.drop(columns=['tenure'])
# Convert DataFrame to long format for Plotly
df_melted = filtered_df.melt(var_name='Feature', value_name='Value')
# Create an interactive box plot with thicker elements
fig = px.box(
df_melted,
x='Value',
y='Feature',
title="Box Plot of Features",
color='Feature', # Different colors for each feature
color_discrete_sequence=px.colors.qualitative.Prism # Color palette
)
# Increase thickness of box elements
fig.update_traces(
boxmean=True, # Show mean as a line inside the box
marker=dict(size=6), # Make outlier points bigger
line=dict(width=3) # Make box plot lines thicker
)
# Improve layout
fig.update_layout(
xaxis_title="Value Distribution",
yaxis_title="Features",
width=900,
height=500,
font=dict(family="Arial, sans-serif", size=12, color="black"),
margin=dict(l=100, r=50, t=50, b=50) # Adjust margins
)
fig.show()
๐ค Machine Learning Modelsยถ
- Creating models (Decision Tree, Logistic Regression, KNN, RandomForest Classifier)
- Evaluating performance (Accuracy, Precision, Recall, F1-score)
- Identifying important features for churn prediction
- Improving model performance with hyperparameter tuning
๐ Decision Tree Classifier - Hyperparameter Tuning & Evaluationยถ
1๏ธโฃ Manual Hyperparameter Tuningยถ
- A Decision Tree Classifier is trained with manually set hyperparameters.
- The model is evaluated using Accuracy, Precision, Recall, F1-Score, and AUC-ROC to measure performance.
#Checking model building with manual tuning of hyperparameters - Decision Tree
# Decision Tree Classifier - Test Size = 0.2
x_dt = df_preprocessed.drop(['Churn', 'scaled_tenure'], axis=1)
y_dt = df_preprocessed['Churn']
# Split the data with test_size = 0.2
x_train_dt, x_test_dt, y_train_dt, y_test_dt = train_test_split(
x_dt, y_dt, test_size=0.2, random_state=42
)
# Initialize and fit the Decision Tree Classifier with the given hyperparameters (manual tuning)
dt_clf = DecisionTreeClassifier(
random_state=42,
criterion='entropy',
max_depth=7,
min_samples_leaf=1,
min_samples_split=2
)
dt_clf.fit(x_train_dt, y_train_dt)
# Make predictions
y_pred_dt = dt_clf.predict(x_test_dt)
# Evaluate performance
accuracy = accuracy_score(y_test_dt, y_pred_dt)
precision = precision_score(y_test_dt, y_pred_dt, pos_label=1)
recall = recall_score(y_test_dt, y_pred_dt, pos_label=1)
f1 = f1_score(y_test_dt, y_pred_dt, pos_label=1)
# Display results
print("\nResults of Decision Tree Classifier with Test Size = 0.2:")
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
Results of Decision Tree Classifier with Test Size = 0.2: Accuracy: 0.472 Precision: 0.481 Recall: 0.494 F1-Score: 0.487
2๏ธโฃ Finding Best Hyperparameters with GridSearchCVยถ
- GridSearchCV is used to identify the best combination of hyperparameters.
- The search is performed over different values of
max_depth,criterion,min_samples_split, andmin_samples_leaf. - The model is evaluated using 5-fold cross-validation with F1-score as the scoring metric.
#Finding the best hyperparameters for the Decision tree with Grid Search CV
x_dt = df_preprocessed.drop(['Churn','scaled_tenure'], axis=1)
y_dt = df_preprocessed['Churn']
# Split data into training and testing sets
x_train_dt, x_test_dt, y_train_dt, y_test_dt = train_test_split(x_dt, y_dt, test_size=0.2, random_state=42)
# Define the refined parameter grid
param_grid = {
'max_depth': [1, 2, 3, 5, 7], # Avoiding 'None' since deep trees overfit
'criterion': ['gini', 'entropy'],
'min_samples_split': [2, 3, 5],
'min_samples_leaf': [1, 2, 3]
}
# Initialize GridSearchCV with 5-fold cross-validation
grid_search_dt = GridSearchCV(
estimator=DecisionTreeClassifier(random_state=42),
param_grid=param_grid,
scoring='f1',
cv=5,
verbose=1,
n_jobs=-1
)
# Perform the grid search
grid_search_dt.fit(x_train_dt, y_train_dt)
# Retrieve the best model
best_clf = grid_search_dt.best_estimator_
# Make predictions on the test set using the best model
y_pred_dt = best_clf.predict(x_test_dt)
# Evaluate the best model
accuracy_dt = accuracy_score(y_test_dt, y_pred_dt)
precision_dt = precision_score(y_test_dt, y_pred_dt)
recall_dt = recall_score(y_test_dt, y_pred_dt)
f1_score_dt = f1_score(y_test_dt, y_pred_dt)
# Print the results
print("Best Parameters for Decision Tree Classifier:", grid_search_dt.best_params_)
print(f'Accuracy: {accuracy_dt:.3f}')
print(f'Precision: {precision_dt:.3f}')
print(f'Recall: {recall_dt:.3f}')
print(f'F1 Score: {f1_score_dt:.3f}')
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Best Parameters for Decision Tree Classifier: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 2}
Accuracy: 0.476
Precision: 0.486
Recall: 0.544
F1 Score: 0.513
3๏ธโฃ Evaluating the Best Model on Different Test Splitsยถ
- The best parameters from GridSearchCV are used to train and test models across different test sizes (0.1, 0.2, 0.3, 0.4).
- The performance of each model is compared using Accuracy, Precision, Recall, F1-Score, and AUC-ROC to analyze the impact of different test splits.
๐น This process ensures that the model is well-optimized and generalizes effectively across different data splits.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Define features and target
x_dt = df_preprocessed.drop(['Churn', 'scaled_tenure'], axis=1)
y_dt = df_preprocessed['Churn']
# Define test sizes to evaluate
test_sizes = [0.1, 0.2, 0.3, 0.4]
# Store results for each test size
results_dt = []
for test_size in test_sizes:
# Split the data
x_train_dt, x_test_dt, y_train_dt, y_test_dt = train_test_split(
x_dt, y_dt, test_size=test_size, random_state=42
)
# Initialize and fit the Decision Tree Classifier with best parameters
dt_clf = DecisionTreeClassifier(**grid_search_dt.best_params_)
dt_clf.fit(x_train_dt, y_train_dt)
# Make predictions
y_train_pred_dt = dt_clf.predict(x_train_dt)
y_test_pred_dt = dt_clf.predict(x_test_dt)
y_score_dt = dt_clf.predict_proba(x_test_dt)[:, 1] # Probability scores for ROC curve
# Compute evaluation metrics
train_accuracy = accuracy_score(y_train_dt, y_train_pred_dt)
test_accuracy = accuracy_score(y_test_dt, y_test_pred_dt)
accuracy = accuracy_score(y_test_dt, y_test_pred_dt)
precision = precision_score(y_test_dt, y_test_pred_dt, zero_division=1)
recall = recall_score(y_test_dt, y_test_pred_dt, zero_division=1)
f1 = f1_score(y_test_dt, y_test_pred_dt, zero_division=1)
# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test_dt, y_score_dt)
roc_auc = auc(fpr, tpr)
# Store results
results_dt.append((test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, y_test_dt, y_test_pred_dt, classification_report(y_test_dt, y_test_pred_dt), fpr, tpr))
# Display all results at the end
print("\nSummary of Results for Decision Tree Classifier:")
for i, (test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, _, _, _, _, _) in enumerate(results_dt):
if i == 1: # Highlight the second record
print(
f"\033[1mTest Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, "
f"F1-Score: {f1:.3f}, AUC-ROC: {roc_auc:.3f}\033[0m"
)
else:
print(
f"Test Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, "
f"F1-Score: {f1:.3f}, AUC-ROC: {roc_auc:.3f}"
)
Summary of Results for Decision Tree Classifier:
Test Size: 0.10 | Accuracy: 0.487, Precision: 0.506, Recall: 0.474, F1-Score: 0.490, AUC-ROC: 0.494
Test Size: 0.20 | Accuracy: 0.476, Precision: 0.486, Recall: 0.544, F1-Score: 0.513, AUC-ROC: 0.466
Test Size: 0.30 | Accuracy: 0.498, Precision: 0.512, Recall: 0.525, F1-Score: 0.518, AUC-ROC: 0.494
Test Size: 0.40 | Accuracy: 0.504, Precision: 0.522, Recall: 0.463, F1-Score: 0.491, AUC-ROC: 0.505
# Extract values for plotting from Decision Tree results
test_sizes = [r[0] for r in results_dt]
train_accuracies = [r[6] for r in results_dt]
test_accuracies = [r[7] for r in results_dt]
# Plot training vs. validation accuracy
plt.figure(figsize=(6, 4))
plt.plot(test_sizes, train_accuracies, label="Training Accuracy", marker='o', linestyle='--', color='blue')
plt.plot(test_sizes, test_accuracies, label="Validation Accuracy", marker='s', linestyle='-', color='red')
plt.xlabel("Test Size")
plt.ylabel("Accuracy")
plt.title("Training vs. Validation Accuracy for Decision Tree")
plt.legend()
plt.grid(True)
plt.show()
# Extract results for the first test size from Decision Tree results
(test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy,
y_test_dt, y_pred_dt, classification_report_dt, fpr, tpr) = results_dt[1]
# Print classification report and AUC-ROC
print(f"Classification Report for Decision Tree (Test Size: {test_size:.2f}):\n")
print(classification_report_dt)
print(f"\nAUC-ROC: {roc_auc:.3f}")
# Plot ROC Curve
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC-ROC = {roc_auc:.3f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'Decision Tree - ROC Curve (Test Size: {test_size:.2f})')
plt.legend(loc="lower right")
plt.grid()
plt.show()
# Compute confusion matrix
conf_matrix = confusion_matrix(y_test_dt, y_pred_dt)
# Display confusion matrix
print(f"\nConfusion Matrix for Decision Tree (Test Size: {test_size:.2f}):")
print(conf_matrix)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=['Not Churned', 'Churned'])
disp.plot(cmap='Blues')
plt.title(f'Confusion Matrix - Decision Tree (Test Size: {test_size:.2f})')
plt.show()
Classification Report for Decision Tree (Test Size: 0.20):
precision recall f1-score support
0 0.46 0.41 0.43 985
1 0.49 0.54 0.51 1015
accuracy 0.48 2000
macro avg 0.48 0.48 0.47 2000
weighted avg 0.48 0.48 0.47 2000
AUC-ROC: 0.466
Confusion Matrix for Decision Tree (Test Size: 0.20): [[401 584] [463 552]]
๐ Logistic Regression Classifier - Hyperparameter Tuning & Evaluationยถ
1๏ธโฃ Manual Hyperparameter Tuningยถ
- A Logistic Regression model is trained with manually set hyperparameters.
- The model is evaluated using Accuracy, Precision, Recall, F1-Score, and AUC-ROC to measure performance.
#Manual tuning of hyperparameters for Decision Logistic Regression
# Logistic Regression Classifier - Test Size = 0.2
x_lr = df_preprocessed.drop(['Churn', 'tenure'], axis=1)
y_lr = df_preprocessed['Churn']
# Split the data with test_size = 0.2
x_train_lr, x_test_lr, y_train_lr, y_test_lr = train_test_split(
x_lr, y_lr, test_size=0.2, random_state=42
)
# Initialize and fit the Logistic Regression model
model = LogisticRegression(
random_state=42,
C=0.01,
l1_ratio=0.6,
max_iter=200,
penalty='elasticnet',
solver='saga'
)
model.fit(x_train_lr, y_train_lr)
# Make predictions
y_pred = model.predict(x_test_lr)
# Evaluate performance
accuracy_lr = accuracy_score(y_test_lr, y_pred)
precision_lr = precision_score(y_test_lr, y_pred, zero_division=1) # Handle undefined precision
recall_lr = recall_score(y_test_lr, y_pred, zero_division=1)
f1_lr = f1_score(y_test_lr, y_pred, zero_division=1)
# Display results
print("\nResults of Logistic Regression Classifier with Test Size = 0.2 & Manual tuning the hyperparameters")
print(f"Accuracy: {accuracy_lr:.3f}")
print(f"Precision: {precision_lr:.3f}")
print(f"Recall: {recall_lr:.3f}")
print(f"F1-Score: {f1_lr:.3f}")
Results of Logistic Regression Classifier with Test Size = 0.2 & Manual tuning the hyperparameters Accuracy: 0.507 Precision: 0.507 Recall: 1.000 F1-Score: 0.673
2๏ธโฃ Finding Best Hyperparameters with GridSearchCVยถ
- GridSearchCV is used to optimize the
penalty,C,solver, andmax_itervalues. - The search is performed using 5-fold cross-validation with F1-score as the metric.
- Why F1-score?
- While optimizing for Recall ensures we engage every potential churned customer, it can increase False Positives, leading to extra marketing costs.
- F1-score balances Precision and Recall, ensuring we prioritize retention without excessive resource wastage.
#Finding the best hyperparameters for the Logistic Regression
# Finding the best hyperparameters for Logistic Regression
x_lr = df_preprocessed.drop(['Churn', 'tenure'], axis=1)
y_lr = df_preprocessed['Churn']
# Split data into training and testing sets
x_train_lr, x_test_lr, y_train_lr, y_test_lr = train_test_split(x_lr, y_lr, test_size=0.2, random_state=42)
# Define the parameter grid for Logistic Regression
# Define the parameter grid
param_grid_lr = [
{'penalty': ['l1'], 'C': [0.05, 0.1, 1, 10], 'solver': ['liblinear'], 'max_iter': [50, 100, 200, 500]},
{'penalty': ['l2'], 'C': [0.05, 0.1, 1, 10], 'solver': ['liblinear', 'saga'], 'max_iter': [50, 100, 200, 500]},
{'penalty': ['elasticnet'], 'C': [0.05, 0.1, 1, 10], 'solver': ['saga'], 'l1_ratio': [0.5], 'max_iter': [50, 100, 200, 500]}
]
# Initialize GridSearchCV
grid_search_lr = GridSearchCV(
estimator=LogisticRegression(random_state=42),
param_grid=param_grid_lr,
scoring='f1',
cv=5,
verbose=1,
n_jobs=-1
)
# Perform the grid search
grid_search_lr.fit(x_train_lr, y_train_lr)
# Retrieve the best model from the search
best_lr_clf = grid_search_lr.best_estimator_
# Make predictions on the test set using the best model
y_pred_lr = best_lr_clf.predict(x_test_lr)
# Evaluate the best model
accuracy_lr = accuracy_score(y_test_lr, y_pred_lr)
precision_lr = precision_score(y_test_lr, y_pred_lr)
recall_lr = recall_score(y_test_lr, y_pred_lr)
f1_score_lr = f1_score(y_test_lr, y_pred_lr)
# Print the results
print("Best Parameters for Logistic Regression Classifier:", grid_search_lr.best_params_)
print(f'Accuracy: {accuracy_lr:.3f}')
print(f'Precision: {precision_lr:.3f}')
print(f'Recall: {recall_lr:.3f}')
print(f'F1 Score: {f1_score_lr:.3f}')
Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best Parameters for Logistic Regression Classifier: {'C': 0.05, 'max_iter': 50, 'penalty': 'l1', 'solver': 'liblinear'}
Accuracy: 0.508
Precision: 0.515
Recall: 0.549
F1 Score: 0.531
3๏ธโฃ Evaluating the Best Model on Different Test Splitsยถ
- The best parameters from GridSearchCV are used to train and test models across different test sizes (0.1, 0.2, 0.3, 0.4).
- The performance of each model is compared using Accuracy, Precision, Recall, F1-Score, and AUC-ROC to analyze the impact of different test splits.
๐น This process ensures that the model is well-optimized, achieves high recall, and generalizes effectively across different data splits.
# Define features and target
x_lr = df_preprocessed.drop(['Churn', 'tenure'], axis=1)
y_lr = df_preprocessed['Churn']
# Define possible test sizes
test_sizes = [0.1, 0.2, 0.3, 0.4]
# Store results for each test size
results_lr = []
for test_size in test_sizes:
# Split the data
x_train_lr, x_test_lr, y_train_lr, y_test_lr = train_test_split(
x_lr, y_lr, test_size=test_size, random_state=42
)
# Initialize and fit the Logistic Regression model with best parameters
model = LogisticRegression(**grid_search_lr.best_params_)
model.fit(x_train_lr, y_train_lr)
# Make predictions
y_train_pred_lr = model.predict(x_train_lr)
y_test_pred_lr = model.predict(x_test_lr)
y_score_lr = model.predict_proba(x_test_lr)[:, 1] # Probability scores for ROC curve
# Compute evaluation metrics
train_accuracy = accuracy_score(y_train_lr, y_train_pred_lr)
test_accuracy = accuracy_score(y_test_lr, y_test_pred_lr)
accuracy = accuracy_score(y_test_lr, y_test_pred_lr)
precision = precision_score(y_test_lr, y_test_pred_lr, zero_division=1)
recall = recall_score(y_test_lr, y_test_pred_lr, zero_division=1)
f1 = f1_score(y_test_lr, y_test_pred_lr, zero_division=1)
# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test_lr, y_score_lr)
roc_auc = auc(fpr, tpr)
# Store results
results_lr.append((test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, y_test_lr, y_test_pred_lr, classification_report(y_test_lr, y_test_pred_lr),fpr, tpr))
# Display all results at the end
print("\nSummary of Results for Logistic Regression Classifier:")
for i, (test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, _, _, _, _, _) in enumerate(results_lr):
if i == 1: # Highlight the second record
print(
f"\033[1mTest Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, "
f"F1-Score: {f1:.3f}, AUC-ROC: {roc_auc:.3f}\033[0m"
)
else:
print(
f"Test Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, "
f"F1-Score: {f1:.3f}, AUC-ROC: {roc_auc:.3f}"
)
Summary of Results for Logistic Regression Classifier:
Test Size: 0.10 | Accuracy: 0.486, Precision: 0.505, Recall: 0.520, F1-Score: 0.512, AUC-ROC: 0.500
Test Size: 0.20 | Accuracy: 0.508, Precision: 0.515, Recall: 0.549, F1-Score: 0.531, AUC-ROC: 0.515
Test Size: 0.30 | Accuracy: 0.502, Precision: 0.520, Recall: 0.403, F1-Score: 0.454, AUC-ROC: 0.509
Test Size: 0.40 | Accuracy: 0.502, Precision: 0.552, Recall: 0.191, F1-Score: 0.284, AUC-ROC: 0.507
# Extract values for plotting
test_sizes = [r[0] for r in results_lr]
train_accuracies = [r[6] for r in results_lr]
test_accuracies = [r[7] for r in results_lr]
# Plot training vs. validation accuracy
plt.figure(figsize=(6, 4))
plt.plot(test_sizes, train_accuracies, label="Training Accuracy", marker='o', linestyle='--', color='blue')
plt.plot(test_sizes, test_accuracies, label="Validation Accuracy", marker='s', linestyle='-', color='red')
plt.xlabel("Test Size")
plt.ylabel("Accuracy")
plt.title("Training vs. Validation Accuracy for Logistic Regression")
plt.legend()
plt.grid(True)
plt.show()
# Extract results for the first test size
test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, y_test_lr, y_pred_lr, classification_report_lr, fpr, tpr = results_lr[1]
# Print classification report and AUC-ROC
print(f"Classification Report for Logistic Regression (Test Size: {test_size:.2f}):\n")
print(classification_report_lr)
print(f"\nAUC-ROC: {roc_auc:.3f}")
# Plot ROC Curve
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC-ROC = {roc_auc:.3f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'Logistic Regression - ROC Curve (Test Size: {test_size:.2f})')
plt.legend(loc="lower right")
plt.grid()
plt.show()
# Compute confusion matrix
conf_matrix = confusion_matrix(y_test_lr, y_pred_lr)
# Display confusion matrix
print(f"\nConfusion Matrix for Logistic Regression (Test Size: {test_size:.2f}):")
print(conf_matrix)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=['Not Churned', 'Churned'])
disp.plot(cmap='Blues')
plt.title(f'Confusion Matrix - Logistic Regression (Test Size: {test_size:.2f})')
plt.show()
Classification Report for Logistic Regression (Test Size: 0.20):
precision recall f1-score support
0 0.50 0.47 0.48 985
1 0.51 0.55 0.53 1015
accuracy 0.51 2000
macro avg 0.51 0.51 0.51 2000
weighted avg 0.51 0.51 0.51 2000
AUC-ROC: 0.515
Confusion Matrix for Logistic Regression (Test Size: 0.20): [[460 525] [458 557]]
๐ Comparison of Logistic Regression and Decision Tree Modelsยถ
๐ Performance Metricsยถ
| ๐น Metric | โก Logistic Regression | ๐ณ Decision Tree |
|---|---|---|
| Accuracy | 0.508 | 0.476 |
| Precision | 0.515 | 0.486 |
| Recall | 0.549 | 0.544 |
| F1-Score | 0.531 | 0.513 |
| AUC-ROC | 0.515 | 0.466 |
๐ Key Insightsยถ
โ 1. Logistic Regression is the Stronger Modelยถ
- Higher Accuracy (0.508 vs. 0.476) โ Better at overall classification.
- Higher Precision (0.515 vs. 0.486) โ Reduces false positives, leading to more reliable predictions.
- Higher AUC-ROC (0.515 vs. 0.466) โ Better at distinguishing between classes.
๐ 2. Decision Tree Still Captures a Lot of Churnersยถ
- Recall is close (0.544 vs. 0.549) โ Detects nearly as many true churners as Logistic Regression.
- Lower precision means it produces more false positives, which may not be ideal for cost-sensitive decisions.
โ๏ธ 3. Logistic Regression Has a Better F1-Score (0.531 vs. 0.513)ยถ
- More balanced between precision and recall, making it the better overall classifier.
๐ Final Verdictยถ
๐น ๐ Logistic Regression is the clear winner, with superior accuracy, precision, recall, and AUC-ROC.
๐น ๐ณ Decision Tree may still be useful when prioritizing recall, but it struggles with precision and overall performance.
๐น Both models can be improvedโwe should explore feature engineering, hyperparameter tuning, and advanced models.
๐ Next Steps:
Weโll now evaluate K-Nearest Neighbors and Ensemble Methods to improve model performance!
Assignment part 2ยถ
๐ K-Nearest Neighbors Classifier - Evaluationยถ
# K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Define features and target
x_knn = df_preprocessed.drop(['Churn', 'tenure'], axis=1)
y_knn = df_preprocessed['Churn']
# Define possible test sizes
test_sizes = [0.1, 0.2, 0.3, 0.4]
# Store results for each test size
results_knn = []
for test_size in test_sizes:
# Split the data
x_train_knn, x_test_knn, y_train_knn, y_test_knn = train_test_split(
x_knn, y_knn, test_size=test_size, random_state=42
)
# Initialize the KNN model with specified hyperparameters (Hyperparameters are taken from the grid search)
knn_model = KNeighborsClassifier(
metric='euclidean',
n_neighbors=5,
weights='distance'
)
# Fit the model
knn_model.fit(x_train_knn, y_train_knn)
# Make predictions
y_pred_knn = knn_model.predict(x_test_knn)
y_score_knn = knn_model.predict_proba(x_test_knn)[:, 1]
# Evaluate performance
accuracy = accuracy_score(y_test_knn, y_pred_knn)
precision = precision_score(y_test_knn, y_pred_knn, zero_division=1)
recall = recall_score(y_test_knn, y_pred_knn, zero_division=1)
f1 = f1_score(y_test_knn, y_pred_knn, zero_division=1)
auc_roc = roc_auc_score(y_test_knn, y_score_knn)
classification_report_knn = classification_report(y_test_knn, y_pred_knn)
# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test_knn, y_score_knn)
# Store the results
results_knn.append((test_size, accuracy, precision, recall, f1, auc_roc, y_test_knn, y_pred_knn, classification_report_knn, fpr, tpr))
# Display all results at the end
print("\nSummary of Results for K-Nearest Neighbors Classifier:")
for i, (test_size, accuracy, precision, recall, f1, auc_roc, _, _, _, _, _) in enumerate(results_knn):
if i == 1: # Highlight the second record
print(
f"\033[1mTest Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, F1-Score: {f1:.3f}, AUC-ROC: {auc_roc:.3f}\033[0m")
else:
print(
f"Test Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, F1-Score: {f1:.3f}, AUC-ROC: {auc_roc:.3f}")
Summary of Results for K-Nearest Neighbors Classifier:
Test Size: 0.10 | Accuracy: 0.489, Precision: 0.508, Recall: 0.487, F1-Score: 0.498, AUC-ROC: 0.500
Test Size: 0.20 | Accuracy: 0.495, Precision: 0.503, Recall: 0.495, F1-Score: 0.499, AUC-ROC: 0.497
Test Size: 0.30 | Accuracy: 0.490, Precision: 0.504, Recall: 0.489, F1-Score: 0.496, AUC-ROC: 0.490
Test Size: 0.40 | Accuracy: 0.491, Precision: 0.507, Recall: 0.490, F1-Score: 0.499, AUC-ROC: 0.494
# Extract values for plotting
test_sizes = [r[0] for r in results_knn]
accuracies = [r[1] for r in results_knn]
# Plot training vs. validation accuracy
plt.figure(figsize=(6, 4))
plt.plot(test_sizes, train_accuracies, label="Training Accuracy", marker='o', linestyle='--', color='blue')
plt.plot(test_sizes, test_accuracies, label="Validation Accuracy", marker='s', linestyle='-', color='red')
plt.xlabel("Test Size")
plt.ylabel("Accuracy")
plt.title("Accuracy vs. Test Size for KNN")
plt.legend()
plt.grid(True)
plt.show()
# Extract results for the first test size from KNN results
test_size, accuracy, precision, recall, f1, auc_roc, y_test_knn, y_pred_knn, classification_report_knn, fpr, tpr = results_knn[1]
# Print the classification report and AUC-ROC
print("Classification Report for K-Nearest Neighbors (Test Size: {:.2f}):\n".format(test_size))
print(classification_report_knn)
print("\nAUC-ROC: {:.3f}".format(auc_roc))
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC-ROC = {auc_roc:.3f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('K-Nearest Neighbors - ROC Curve (Test Size: {:.2f})'.format(test_size))
plt.legend(loc="lower right")
plt.grid()
plt.show()
# Compute confusion matrix
conf_matrix = confusion_matrix(y_test_knn, y_pred_knn)
# Display confusion matrix
print("\nConfusion Matrix for K-Nearest Neighbors (Test Size: {:.2f}):".format(test_size))
print(conf_matrix)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=['Not Churned', 'Churned'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix - K-Nearest Neighbors (Test Size: {:.2f})'.format(test_size))
plt.show()
Classification Report for K-Nearest Neighbors (Test Size: 0.20):
precision recall f1-score support
0 0.49 0.50 0.49 985
1 0.50 0.49 0.50 1015
accuracy 0.49 2000
macro avg 0.50 0.50 0.49 2000
weighted avg 0.50 0.49 0.50 2000
AUC-ROC: 0.497
Confusion Matrix for K-Nearest Neighbors (Test Size: 0.20): [[488 497] [513 502]]
๐ Ensemble Method - Random Forest Classifier - Evaluationยถ
# Random Forest Classifier (Ensemble Method)
# Feature matrix and target variable
x_rf = df_preprocessed.drop(['Churn', 'tenure'], axis=1)
y_rf = df_preprocessed['Churn']
# Define possible test sizes
test_sizes = [0.1, 0.2, 0.3, 0.4]
# Store results for each test size
results_rf = []
for test_size in test_sizes:
# Split the data
x_train_rf, x_test_rf, y_train_rf, y_test_rf = train_test_split(
x_rf, y_rf, test_size=test_size, random_state=42
)
# Initialize Random Forest model with specified hyperparameters with manual tuning
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees in the forest
max_depth=None, # Maximum depth of the tree (None means nodes expand until all leaves are pure)
random_state=42, # Random seed for reproducibility
bootstrap=True, # Bagging enabled
)
# Fit the model on the training data
rf_model.fit(x_train_rf, y_train_rf)
# Predict on training and test data
y_train_pred_rf = rf_model.predict(x_train_rf)
y_test_pred_rf = rf_model.predict(x_test_rf)
# Evaluate performance
train_accuracy = accuracy_score(y_train_rf, y_train_pred_rf)
test_accuracy = accuracy_score(y_test_rf, y_test_pred_rf)
accuracy = accuracy_score(y_test_rf, y_test_pred_rf)
precision = precision_score(y_test_rf, y_test_pred_rf, zero_division=1)
recall = recall_score(y_test_rf, y_test_pred_rf, zero_division=1)
f1 = f1_score(y_test_rf, y_test_pred_rf, zero_division=1)
classification_report_rf = classification_report(y_test_rf, y_test_pred_rf)
# Compute and plot AUC-ROC curve
y_score_rf = rf_model.predict_proba(x_test_rf)[:, 1]
fpr, tpr, _ = roc_curve(y_test_rf, y_score_rf)
roc_auc = auc(fpr, tpr)
# Store results
results_rf.append((test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, y_test_rf, y_test_pred_rf, classification_report_rf))
# Display all results at the end
print("\nSummary of Results for Random Forest Classifier (Bagging):")
for i, (test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, y_test_rf, y_test_pred_rf, classification_report_rf) in enumerate(results_rf):
if i == 1: # Highlight the second record
print(
f"\033[1mTest Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, F1-Score: {f1:.3f}, AUC-ROC: {roc_auc:.3f}\033[0m"
)
else:
print(
f"Test Size: {test_size:.2f} | Accuracy: {accuracy:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}, F1-Score: {f1:.3f}, AUC-ROC: {roc_auc:.3f}"
)
Summary of Results for Random Forest Classifier (Bagging):
Test Size: 0.10 | Accuracy: 0.483, Precision: 0.502, Recall: 0.493, F1-Score: 0.498, AUC-ROC: 0.491
Test Size: 0.20 | Accuracy: 0.483, Precision: 0.491, Recall: 0.482, F1-Score: 0.486, AUC-ROC: 0.494
Test Size: 0.30 | Accuracy: 0.489, Precision: 0.503, Recall: 0.481, F1-Score: 0.492, AUC-ROC: 0.494
Test Size: 0.40 | Accuracy: 0.493, Precision: 0.509, Recall: 0.485, F1-Score: 0.497, AUC-ROC: 0.499
# Extract values for plotting
test_sizes = [r[0] for r in results_rf]
train_accuracies = [r[6] for r in results_rf]
test_accuracies = [r[7] for r in results_rf]
# Plot training vs. validation accuracy
plt.figure(figsize=(6, 4))
plt.plot(test_sizes, train_accuracies, label="Training Accuracy", marker='o', linestyle='--', color='blue')
plt.plot(test_sizes, test_accuracies, label="Validation Accuracy", marker='s', linestyle='-', color='red')
plt.xlabel("Test Size")
plt.ylabel("Accuracy")
plt.title("Training vs. Validation Accuracy for Random Forest")
plt.legend()
plt.grid(True)
plt.show()
# Extract results for the first test size
test_size, accuracy, precision, recall, f1, roc_auc, train_accuracy, test_accuracy, y_test_rf, y_test_pred_rf, classification_report_rf = results_rf[1]
# Print the classification report and AUC-ROC
print("Classification Report for Random forest (Test Size: {:.2f}):\n".format(test_size))
print(classification_report_rf)
print("\nAUC-ROC: {:.3f}".format(roc_auc))
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC-ROC = {roc_auc:.3f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random forest - ROC Curve (Test Size: {:.2f})'.format(test_size))
plt.legend(loc="lower right")
plt.grid()
plt.show()
# Compute confusion matrix
conf_matrix = confusion_matrix(y_test_rf, y_test_pred_rf)
# Display confusion matrix
print("\nConfusion Matrix for Random forest (Test Size: {:.2f}):".format(test_size))
print(conf_matrix)
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=['Not Churned', 'Churned'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix - Random forest (Test Size: {:.2f})'.format(test_size))
plt.show()
Classification Report for Random forest (Test Size: 0.20):
precision recall f1-score support
0 0.48 0.49 0.48 985
1 0.49 0.48 0.49 1015
accuracy 0.48 2000
macro avg 0.48 0.48 0.48 2000
weighted avg 0.48 0.48 0.48 2000
AUC-ROC: 0.494
Confusion Matrix for Random forest (Test Size: 0.20): [[478 507] [526 489]]
# Performance compare of each models
# Data
models = ['Decision Tree', 'Logistic Regression', 'KNN', 'Random Forest']
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC']
# Extract the second (index 1) results for each model and round to 3 decimal places
data = {
'Decision Tree': [round(metric, 3) for metric in results_dt[1][1:6]], # Skip the test size (1st item)
'Logistic Regression': [round(metric, 3) for metric in results_lr[1][1:6]], # Skip the test size
'KNN': [round(metric, 3) for metric in results_knn[1][1:6]], # Skip the test size
'Random Forest': [round(metric, 3) for metric in results_rf[1][1:6]] # Skip the test size
}
print(data)
# Print the created data dictionary
# print("Extracted Data (Rounded to 3 Decimal Places):")
# print(data)
# Convert to DataFrame
df_models = pd.DataFrame(data, index=metrics)
# Transform DataFrame into long format for Plotly
df_melted = df_models.reset_index().melt(id_vars='index', var_name='Model', value_name='Score')
df_melted.rename(columns={'index': 'Metric'}, inplace=True)
# Plot using Plotly
fig = px.histogram(
df_melted,
x='Metric', # Metrics on the x-axis
y='Score', # Scores on the y-axis
color='Model', # Grouped by models
barmode='group', # Bars grouped side-by-side
title='Model Performance Comparison', # Title of the chart
color_discrete_sequence=px.colors.qualitative.Prism # Define color palette
)
# Customize layout
fig.update_layout(
xaxis_title='Evaluation Metrics',
yaxis_title='Score',
width=1000,
height=500,
legend_title='Models'
)
# Show the interactive plot
fig.show()
{'Decision Tree': [0.476, 0.486, 0.544, 0.513, 0.466], 'Logistic Regression': [0.508, 0.515, 0.549, 0.531, 0.515], 'KNN': [0.495, 0.503, 0.495, 0.499, 0.497], 'Random Forest': [0.483, 0.491, 0.482, 0.486, 0.494]}
๐ Customer Churn Prediction Model Evaluationยถ
๐ Problem Statementยถ
Predicting customer churn is crucial for telecom companies to retain customers and reduce revenue loss. Churn occurs when customers discontinue services, impacting business sustainability. By accurately predicting churn, companies can implement targeted retention strategies such as personalized offers, better customer service, and proactive engagement.
๐ Overall Model Performanceยถ
All models exhibit relatively low performance, with accuracy scores hovering around 50%. This suggests potential challenges in the dataset, such as:
๐น High noise โ irrelevant or inconsistent data
๐น Weak predictive features โ limited strong indicators of churn
๐น Class imbalance โ disproportionate churn vs. non-churn cases
However, even slight improvements over random guessing (50%) can translate into significant business impact, making these insights valuable for retention efforts.
๐ค Models Evaluated & Metricsยถ
We trained and tested four machine learning models to predict churn:
โ Decision Tree
โ Logistic Regression
โ K-Nearest Neighbors (KNN)
โ Random Forest
Each model was evaluated using:
- Accuracy โ Overall correctness of the model.
- Precision โ Percentage of predicted churners that actually churned.
- Recall โ Percentage of actual churners correctly identified.
- F1-Score โ Balances precision and recall for overall effectiveness.
- AUC-ROC โ Measures the modelโs ability to distinguish churners from non-churners.
๐ Performance Comparisonยถ
| Model | Accuracy | Precision | Recall | F1-Score | AUC-ROC |
|---|---|---|---|---|---|
| Decision Tree | 0.476 | 0.486 | 0.544 | 0.513 | 0.466 |
| Logistic Regression | 0.508 | 0.515 | 0.549 | 0.531 | 0.515 |
| KNN | 0.495 | 0.503 | 0.495 | 0.499 | 0.497 |
| Random Forest | 0.483 | 0.491 | 0.482 | 0.486 | 0.494 |
๐ Best Model Selection โ Logistic Regressionยถ
Among the evaluated models, Logistic Regression achieves the highest accuracy (0.508), recall (0.549), and AUC-ROC (0.515), making it the best performer.
๐น Why Logistic Regression?ยถ
โ
Highest Recall (0.549) and AUC-ROC (0.515)
โ Outperforms Decision Tree, KNN, and Random Forest in distinguishing churners from non-churners.
โ
Best Balance Between Precision and Recall
โ Ensures a good trade-off between correctly identifying churners and minimizing false positives.
โ
More Consistent Performance
โ Unlike KNN and Random Forest, which have lower recall, Logistic Regression captures more churners effectively.
๐ก Business Impactยถ
Churn prediction is a trade-off between recall and precision:
๐น Decision Tree prioritizes recall, meaning it catches more churners but misclassifies more loyal customers, increasing unnecessary interventions.
๐น Logistic Regression balances recall and precision, making it a reliable alternative.
๐น KNN and Random Forest perform worse overall, with lower recall and accuracy.
โ๏ธ Considerations & Next Stepsยถ
โ If recall is the top priority, Decision Tree remains a good choice.
โ For interpretability and computational efficiency, Logistic Regression is preferable.
โ Exploring ensemble methods and feature engineering could further improve predictions.
๐ฎ Conclusionยถ
For customer churn prediction, Logistic Regression emerges as the best model due to its superior accuracy, recall, and AUC-ROC score. Further refinements in feature selection and hyperparameter tuning could improve overall performance for better business impact.